Predicting age at onset of type 1 diabetes in children using regression, artificial neural network and Random Forest: A case study in Saudi Arabia

The rising incidence of type 1 diabetes (T1D) among children is an increasing concern globally. A reliable estimate of the age at onset of T1D in children would facilitate intervention plans for medical practitioners to reduce the problems with delayed diagnosis of T1D. This paper has utilised Multiple Linear Regression (MLR), Artificial Neural Network (ANN) and Random Forest (RF) to model and predict the age at onset of T1D in children in Saudi Arabia (S.A.) which is ranked as the 7th for the highest number of T1D and 5th in the world for the incidence rate of T1D. De-identified data between (2010-2020) from three cities in S.A. were used to model and predict the age at onset of T1D. The best subset model selection criteria, coefficient of determination, and diagnostic tests were deployed to select the most significant variables. The efficacy of models for predicting the age at onset was assessed using multi-prediction accuracy measures. The average age at onset of T1D is 6.2 years and the most common age group for onset is (5-9) years. Most of the children in the sample (68%) are from urban areas of S.A., 75% were delivered after a full term pregnancy length and 31% were delivered through a cesarean section. The models of best fit were the MLR and RF models with R2 = (0.85 and 0.95), the root mean square error = (0.25 and 0.15) and mean absolute error = (0.19 and 0.11) respectively for logarithm of age at onset. This study for the first time has utilised MLR, ANN and RF models to predict the age at onset of T1D in children in S.A. These models can effectively aid health care providers to monitor and create intervention strategies to reduce the impact of T1D in children in S.A.

Introduction Type 1 diabetes (T1D) is a metabolic disorder generally recognised as a result of an autoimmune response that affects insulin-producing β cells in the pancreas, which results in extreme insulin deficiencies and associated hyperglycemia [1]. Poor glycemic control as a result of T1D complications can result in diabetic ketoacidosis which may result in significant neurological complications, hospitalization and death [2,3]. T1D can cause long-term complications like blindness from retinopathy [3] and can also cause kidney failure [4]. In addition, chronic diseases such as diabetes can disturb physiology, impacting linear growth and pubertal development [5]. T1D can cause severe dermatological complications [6]. The incidence rate of T1D worldwide increases by about 3% to 4% per year [7]. Recently, the International Diabetes Federation (IDF) Atlas' 9th edition (2019), estimated that the number of children and adolescents under the age of 15 years worldwide who are living with T1D was 600,900 [8]. Moreover, it is estimated that more than 98,000 children and adolescents under the age of 15 years are diagnosed with T1D annually, and that number increases to 128,900 when the age range is extended to 20 years [8]. According to the IDF (2019), Saudi Arabia has a high incidence rate of new cases of T1D in children and adolescents (<15 years of age) at 31.4 cases/100,000 children each year, which places it as the 5th highest worldwide [8]. The 9th IDF editions reported the number of new cases of T1D among children under 15 years to be 2,800 [8].
This highlights the urgent need for an improved understanding of the development of T1D in Saudi Arabian children to facilitate improved monitoring to reduce the complications of delayed diagnosis. In addition, this may assist the practicality of early intervention trials, to increase the chance of disease mitigation prior to the onset of dysglycemia to retain a greater number of functional islet cells.

Literature survey
The onset of T1D is affected by multiple genetic and environmental risk factors [9][10][11][12]. The roles of the environment and genetics on the development of T1D have been recognised for more than 40 years [13], however, determining the environmental and perinatal risk factors of T1D is ongoing [10,11,14]. Risk factors shown to be associated with T1D onset include childhood infections, diet, family history of diabetes [10,11,15] and perinatal factors [16] while the relative contribution of each factor has not been clearly determined [10]. Published studies related to modelling age at onset of T1D were reviewed and the issues in the context of the proposed paper are summarised in S2 Table. It has been reported that preterm (<37 weeks) [17], and birth weight [17,18] were associated with earlier onset of T1D in children, however, were not significant risk factors for developing T1D [19]. Early T1D onset in children was related to family history of T1D [20,21], in siblings [22], or in parents [19], but maternal diabetes was significantly associated with an older age of onset [22]. Maternal age of 25-29 years, or the father's age of � 30 years were both identified as risk factors of early T1D in children in [19] whereas they were not found to be significant risk factors where paternal age greater than 25 years was considered by [20] nor with maternal age at delivery (�35 years) [19]. A study conducted in the UK [23], reported that in consanguineous pedigrees, Wolcott-Rallison syndrome is the most common cause of chronic neonatal diabetes. Also, children who were younger at the time of diagnosis tended to be heavier [24][25][26], and taller [24,26]. Other risk factors such as season of birth, year of birth, gestational age size [17], mixed feeding, children's prior history of infections [21], cesarean section, gestational diabetes, pre-eclampsia [19] and maternal weight at childbirth [27] were linked to early onset of T1D in children. Gender, ethnicity, and a history of autoimmune disease in the family [17,21], higher birth order, multiple bacterial infections, and residing in high population density areas were not associated with early onset of T1D [19]. In addition, studies from Saudi Arabia that have examined factors contributing to T1D in children have indicated that the incidence of childhood T1D may be associated with vitamin D deficiency [28,29]. Therefore, the possibility to prevent, delay or reduce complications of T1D diabetes in children is an important area of research [3,30,31]. In the preventative studies [32,33], it was shown that elimination of cow's milk proteins in infant formula (in the Finish TRIGR pilot research [32]) or the elimination of bovine insulin in infant formula (in the FINDIA study [33]) both reduced the production of islet autoantibodies. However, the existing T1D research such as those conducted in Sweden and Finland [34,35] do not represent the ethnicity and diversity of the Saudi Arabia population. This study aims to fill the gap by developing predictive models to estimate the age at onset of TID using data from Saudi Arabia.
Further studies have investigated methods for prediction of age at onset of T1D. The early exposure to respiratory infections was shown to have a higher risk for autoantibody seroconversion in children with a family history of T1D [36]. This was identified through on going monitoring islet autoantibodies during their first 3 years of life [36]. Longitudinal autoantibody measurements have also been used as a risk predictor in families that have a first-degree relative with T1D [37], in the general populations [38] or in individuals identified as being at risk [37,39,40]. Also, genetic factors and genetic risk scores were used to identify the presence of islet autoantibodies in children with high-risk HLA genotypes [41][42][43][44]. Examining metabolic changes indicates that post-challenge C-peptide levels start to drop significantly six months before diagnosis [45]. A combined risk score model including clinical, genetic, and immunological characteristics created for high-risk children (who were followed from birth until 9 years) showed a significantly improved T1D prediction compared to autoantibodies alone [46]. However, beyond the above research there is a lack of application of machine learning methods for developing models of age at onset of T1D. This is despite many studies utilising various methods of machine learning for type 2 diabetes [47-49]. Hence, the proposed work differs from previous research in that it models the age at onset of T1D in children by using statistical and machine learning models to identify the risk factors and create a predictive model.

Motivation and the objective of the proposed research
Saudi Arabia has an increasing incidence rate of T1D in children and it is ranked as the 7th for the highest number of T1D and 5th in the world for the incidence rate of T1D. Despite the remarkable increase in the incidence of childhood T1D in Saudi Arabia, there is a lack of meticulously carried out research on T1D in children in Saudi Arabia compared with developed countries [50,51]. In addition, most of the published research of TID in children in Saudi Arabia are cross-sectional with small sample sizes and involve a single center and a single city/region of the country [50]. Consequently, prior studies do not accurately represent the country's large and diverse population. Hence, it is important to carry out research on modeling the age at onset of T1D, with the aim to reduce the problems with delayed diagnosis of T1D in Saudi Arabia. This will both support the improvement of the health of the nation and add to the current research of T1D in diverse populations, while recognising the lack of T1D studies for Saudi Arabia children. The existing T1D research conducted in Saudi Arabia have examined the different aspects of T1D in children but not have modelled the age at onset of T1D in children. This study aims to fill the gap by utilising a secondary data source, curated specifically for this study to develop the most suitable predictive model for predicting age at onset of T1D in children. As to the best of our knowledge, no previous studies have modelled the age at onset of T1D in children in Saudi Arabia and identified the risk factors. In addition, there is no study in the literature comparing the following chosen methods of multiple linear regression (MLR), artificial neural network (ANN) and Random Forest (RF) for the age at onset of T1D in children <15 years old using local data.
The results of this study indicate that MLR and RF models outperform the ANN model for age group of <15. However, the RF model performs better than MLR and ANN models for the most common age group (5-9) based on coefficient of determination R 2 , root-mean-square error (RMSE) and mean absolute error (MAE).

Data and methods
This section outlines the data collection, the development and evaluation of the prediction models using multiple linear regression (MLR), artificial neural network (ANN) and Random Forest (RF) methodologies.

Data collection
De-identified data for 359 individuals were collected using medical files from three diabetes clinics, located in three major cities of Saudi Arabia. Ethical approval was obtained from the RMIT University Human Research Ethics committee in Australia and the Research Ethics Committee of the Ministry of Health in Saudi Arabia. The need for informed consent was waived by the ethics committee as this was a retrospective study of medical records. All data were fully anonymized before analysis. An overview of the demographic breakdown of the data are given in (Fig 1). Additional demographic data was collected based on gender, residency, consanguineous parents, birth weight, birth year, and birth order, pregnancy factors such as gestational age in weeks and mode of delivery (normal delivery/caesarean section), clinical history such as family history of diabetes, child's weight and height and maternal characteristics such as maternal age at child's birth. Pregnancy length was grouped based on the gestational weeks used by the World Health Organisation (WHO) and American College of Obstetricians and Gynaecologists Committee on Obstetric Practice Society for Maternal-Foetal Medicine [52, 53], �36 (preterm), 37-38 (early term), 39-40 (full term), 41 (late term) and �42 (post term).

Model development
This study compares the full cohort in addition to a focused study on the common age group for T1D diagnosis of (5-9) years to model the age at onset of T1D.
As previously shown in the literature [17][18][19][20][21][22][23][24][25][26], the child's gender, having a family history of T1D, pregnancy length, birth year, birth weight, residency, mode of birth delivery, consanguineous parents, maternal age at birth, birth order, elevated weight at diagnosis and height at diagnosis could influence the age at onset of T1D and were included in the analysis. The efficacy of models were assessed using coefficient of determination R 2 , root-mean-square error (RMSE) and mean absolute error (MAE).
MLR, ANN and RF models have been used in many studies to describe different systems [54][55][56][57][58][59][60][61], and hence were chosen to model the age at onset of T1D in this cohort. The statistical software R was used to perform the analysis [62].
Multiple linear regression (MLR). MLR [63] is used in this study as a prediction technique to model age at onset of T1D based on the independent variables suggested from the literature and additional variables collected in this study. The general MLR model is defined by the following equation: where y is the dependent variable (age at onset), β 0 is the intercept, β 1 , � � �, β k are the regression coefficients for the independent variables or interaction terms, x 1 , � � �, x k are independent variables and ε is the residual term of the model. Artificial Neural Network (ANN). Advances in machine learning in the medical area have recently provided new opportunities in the field of disease prediction and prescription treatment [64]. ANN is a branch of the wider field of machine learning. It is one of the well known prediction approaches used for finding a solution when other statistical methods can not be effectively utilized. The benefits of this method, including the ability to learn from instances, fault tolerance and non-linear data forecasting, make it a suitable statistical method [65]. One of the major benefits of ANN is its ability to distinguish hidden linear and nonlinear relationships, often in high-dimensional and complex data sets [66]. ANN consists of input, hidden and output layers known as neurons [67] (Fig 2). The number of input layer neurons represents the number of variables which describe the features being evaluated, whereas the output layer neuron is the dependant variable. The number of hidden layers and the number of neurons depend on the quantity of data and the complexity of the relationship between the input and output layers. Every neuron in the hidden and output layer are linked by a corresponding numerical weight to all neurons in the proceeding layer [68].
Random Forest (RF). Decision trees have become a very common machine learning tool in recent years due to its simplicity, ease of use and interpretability [69]. Various studies have been performed to address the limitations of conventional decision trees such as lack of robustness and suboptimal performance [70]. The development of an ensemble of trees followed by a vote of the most common class, is one of the most useful techniques that resulted from these studies [71]. Random forest (RF) is an ensemble learning approach and the output of a number of weak learners which may be a single decision tree is improved through a voting scheme similar to other ensemble learning methods [54]. Since RF has a built-in feature selection method, it can handle a large number of input variables without the need to minimize dimensionality and the overfitting can be controlled by using out-of-bag validation [72].

Descriptive statistical analyses of cohort of T1D in Saudi Arabia
The trend of the reported cases of children with T1D between 2010 and 2020 is shown in (Fig  3). It can be seen that there has been an upward trend in the number of reported cases in these centres during this period. As shown in Table 1, the mean age at onset of T1D in this cohort was 6.2 years with standard deviation of 3.28. For males (n = 149), the mean age at onset was 6.1 years with standard deviation of 3.18 while for females (n = 210) the average is 6.3 years with standard deviation of 3.35. The median and mode for the full cohort were 6 and 8.1 years, respectively. The maximum age of diagnosis was 13.9 years while an age of 1 month was the minimum age (Table 1). Females scored only slightly less than the males on the average age at onset of T1D, but the difference was not large enough to be statistically significant (t = 0.56909, p = .569) ( Table 2). In addition, there is a significant interaction between gender and cities that shows a difference for males in Riyadh compared to males in Jeddah (Table 3). The full ANOVA Results of the interaction between gender and cities is provided in S2 Table. Also, as shown in (Fig 4), the distribution of the age at onset of T1D was not normal (Shapiro-Wilk test, p-value<0.001, Kurtosis = -0.808). The mean height of this group was (1.20 ± 0.20) metres, the mean weight was (22.6 ± 11.02) kg and the mean birth weight was (2.90 ± 0.64) kg. The range of birth year for this cohort is from (2003 to 2020). The median birth year was 2010. The number of females in this sample is higher than males with 210 females compared to 149 males. In addition, (Fig 5) illustrates that the females have higher incidence of T1D than males over this period. We can also see an approximately even distribution across sites with 28% of  the cases from Al-Ahsa, 44% from Jeddah and 28% from Riyadh. The majority of the sample (68%) are from urban areas of Saudi Arabia. Almost half of the cases (47%) have parents that are consanguineous and 49% of all cases have a family history of diabetes. The majority (75%) were delivered after a full term pregnancy length and 31% were delivered through a caesarean section. 32% of cases were the firstborn child. 57% of cases at birth were to mothers aged between 25 and 35 years old.
Across the three cities the distribution of age and gender is also different (Fig 6). In Al-Ahsa and Riyadh, the majority of patients are females in all age groups whereas in Jeddah, the number of males were higher than females in the age group 5-9 years. The figure also shows that the most common age group of children for onset of T1D in the three cities is 5-9 years.

Multiple linear regression modeling (MLR)
We have developed MLR models based on the dependent variable age at onset (y) and also investigated transformations of the dependent variable; the square root of y and the logarithm

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest of y. To improve the efficacy of the MLR models, interactions between independent variables were considered. The MLR models with interactions were selected based on the step-wise selection criteria of the smallest Akaike's Information Criteria (AIC). Table 4 illustrates the variables in each MLR model. Table 5 shows the results of MLR models together with their corresponding R 2 , Adjusted R 2 , RMSE and MAE. In Table 5, the best model was MLR model (6), which contains independent variables in addition to interactions between variables shown in Table 10. Comparison of the MLR models in Table 5 shows that the transformation of the dependent variable decreases the values of RMSE and MAE. (Fig 7) shows the observed values versus fitted values for the MLR models. The plot shows that the MLR model (6) of the logarithm of age at onset of T1D with interactions between variables (c2) was the best model based on the value of R 2 , adjusted R 2 and the smallest values of RMSE and MAE.
To further improve the MLR models and address potentially influencing outliers, the cases that had age at onset of T1D less than one year were removed from further analysis. The results in Table 6 and (Fig 8), indicated that the best model was still M6 the logarithm of age at onset of T1D which achieves R 2 of 0.85 and the smallest values of RMSE and MAE (0.25 and 0.19). (Fig 8) shows the observed values versus fitted values for the MLR models without and with interactions after removing outliers.

Artificial Neural Network modeling (ANN)
Data were randomly divided into two subsets of 80% for training to build the ANN models and 20% to use as a testing set to assess the validity of the ANN. For ANN, there is no general Table 4. Independent and interaction variables in selected MLR models.

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest rule for determining the number of neurons in the hidden layers [73]. The number of hidden layer neurons varies from problem to problem and it depends on the number and quality of training patterns [74]. The chosen model has an input layer with 13 inputs, the two hidden layers have 13 neurons and the output layer has one output. All independent variables (X's) were used as input data for the ANN. To assess the effect of the hidden layers on neural network output, the number of neurons in a hidden layer was varied. Higher number of neurons did not make a significant difference in the performance of the ANN models, so 13 neurons have

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest been chosen to reduce the complexity of network (Fig 9). Table 7 shows the results of ANN models.
Comparing the results for the testing data, the ANN models for age at onset, square root of age at onset and logarithm of age at onset have R 2 of 0.77, 0.72 and 0.73, RMSE of (2.53, 0.59 and 0.52) and MAE (2.06, 0.51 and 0.44) respectively as shown in Table 7. Therefore, based on R 2 , the ANN model A1 of age at onset of T1D outperformed the other ANN models. (Fig 10) shows the plots of observed values versus predicted values of the ANN models for the training and testing data.

Random Forest modeling (RF)
Data were randomly divided into two subsets consisting of 80% for training to build the RF models and 20% to use as a testing set to assess the validity of models. In the RF models, the number of trees was set at 500 and the number of variables in each node was set at 4. The results of RF models are shown in Table 8. The RF models for age at onset, square root of age at onset and logarithm of age at onset all have R 2 =0.95 for the training data while the testing data had R 2 of (0.87, 0.89 and 0.89) respectively. As the square root of age at onset and logarithm of age at onset models both have the higher R 2 of 0.89, the logarithm of age at onset was therefore chosen as the best RF model (RF3) based on the smaller RMSE and MAE. (Fig 11) shows the plots of observed values versus predicted values of the RF models for the training and testing data.

Model validation
Validation of the best model of each MLR, ANN and RF was conducted based on their corresponding, R 2 , RMSE and MAE when applied to the test data set. The results are summarized in Table 9 and clearly show that MLR and RF outperform the ANN model with R 2 = 0.88 and 0.89, RMSE = (0.22 and 0.21) and MAE = (0.18 and 0.17) respectively. In addition, the selected MLR and RF models both show a high accuracy of 0.99.

Age at onset of T1D, considering environmental factors and family history of diabetes
To assess the impact of environmental factors on the age at onset of T1D the selected MLR, ANN and RF models (M6, A1 and RF3) were utilized.

Risk factors of age at onset of T1D based on MLR
The analysis of the variables that influence the age at onset of T1D based on the best MLR model with their corresponding P-value and 95% confidence interval is provided for the full model in Table 10 (non significant interaction variables were excluded for brevity).

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest

Selection of the significant variables based on ANN models
Olden's algorithm [75] uses the product of raw link weights between the input and the output neuron, and sums the product over all the hidden neurons. A benefit of this method is that the relative contributions of each connection weight in terms of magnitude and sign are retained. This algorithm was used for the ANN model to investigate the relative importance of each variable and is shown in (Fig 12). The figure reveals that some variables have a positive, and

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest

PLOS ONE
some have a negative, relationship with the age at onset of T1D. The results suggest that birth delivery mode has the strongest positive relationship and consanguineous parents has the strongest negative relationship. Similarly, variables that have relative importance close to zero, such as having a family history of T1D does not have any substantial importance for the response variable age at onset.

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest

Selection of the significant variables based on RF models
RF was used to identify the important variables that influence the age at onset of T1D (Fig 13). The figure shows that the child's weight and height at diagnosis, along with the birth year are the most important variables for the response variable followed by birth weight.

Modeling age at onset of T1D for 5-9 years age group
It has previously been reported that the incidence rate of T1D cases in Saudi Arabia was higher for the age group �5 years compared to the age group <5 years [76,77]. A similar conclusion was made in Italy [21]. This is also observed in our cohort as shown in Figs 1 and 6 where the most common age for children with T1D is between 5 and 9 years. This section utilized MLR, ANN, and RF to model age at onset for this specific age group. The sample size for this age group is 170 children from three different cities of Saudi Arabia. The group has a mean of 6.9 years, standard deviation of 1.4 years, median of 6.9 and mode at 6.1 and 8.1 years. The mean height of this group was (1.23 ± 0.11) metres, the mean weight was (23.2 ± 7.98) kg and the mean birth weight was (2.96 ± 0.68) kg. The median birth year in this group was 2010. Table 11 shows the description of the data for this age group.

Multiple linear regression models for 5-9 years age group
MLR models of age at onset of childhood T1D has been developed for this age group. The MLR modeling for the full cohort showed that interactions improved the performance of MLR

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest models, and hence has been examined also in this section as shown in Table 12 with list of the independent and interaction variables in each MLR model. Table 13 illustrates the results of MLR models for this age group without and with interactions between variables. Based on R 2 , adjusted R 2 , RMSE and MAE, the MLR model M12 using the logarithm of age at onset with interactions between variables was the best model. In Table 13, the MLR models without interactions have R 2 of (0.44) with RMSE (1.02, 0.19 and 0.15) and MAE (0.83, 0.16 and 0.12) for age at onset, square root and logarithm of age at

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest onset respectively. A small improvement in the MLR models was observed by adding interactions. In the models with interactions, MLR models M10-M12, R 2 were (0.64, 0.63, and 0.64), RMSE were (0.82, 0.16 and 0.12) and MAE were (0.67, 0.13 and 0.10) respectively. (Fig 14) displays the plots of observed values versus fitted values for the MLR models without and with interactions between variables for 5-9 years age group.
From Table 13, the best model of age at onset of T1D was the logarithm model with interactions between variables (c2). However, the MLR for this age group has not performed as well as the MLR for the full cohort.

Artificial neural network models for 5-9 years age group
ANN models were also utilized to model the age at onset of T1D in the age group (5-9). Data in this age group were randomly divided into two subsets in a ratio of 80% for training to build the ANN models and 20% to use as a testing set to validate the models. The input layer has 13 inputs, two hidden layers were used with 13 neurons and the output layer has one output as with the full cohort. Table 14 shows the results of ANN models in the age group of 5-9 years.
The ANN models for age at onset of T1D have values of R 2 (0.16, 0.12 and 0.14) and RMSE values of (2.19, 0.71 and 0.49) and MAE of (1.85, 0.67 and 0.45) respectively for models A4-A6. The low values of R 2 of ANN models indicates a poor fit to the data for age at onset of T1D in this age group. (Fig 15) shows the results of the observed values versus predicted values for the ANN models of (5-9) age group.

Random Forest models for 5-9 years age group
RF models were also used to model the age at onset of T1D in this age group. Data divided into two subsets in a ratio of 80% for training to build the RF models and 20% to use as a testing set to validate the models. Table 15 shows the results of RF models in this age group.
In Table 15, the RF models for age at onset of T1D have values of R 2 (0.77, 0.78 and 0.78) and RMSE values of (0.59, 0.11 and 0.08) and MAE of (0.51, 0.09 and 0.07) respectively for models RF4-RF6. Therefore, the best RF model RF6 was again using the logarithm of age at onset. The RF models have the highest values of R 2 compared to MLR and ANN models in this age group. (Fig 16) shows the results of the observed values versus predicted values for the RF models of (5-9) age group for both training and testing subgroups.

Validation of models for the 5-9 years age group
This section assesses the efficacy of the selected MLR, ANN and RF models. Data were divided into two subsets of 80% for training and 20% for use as a testing set. The values of R 2 , RMSE

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest and MAE were used to assess the efficacy of the models. The results presented in Table 16 indicate that the RF model is the best fit for the data. From Table 16, the RF model has a higher R 2 = 0.78 compared to MLR and ANN models. Also, RMSE and MAE values for RF were 0.08 and 0.07 which are smaller than those of MLR and ANN models. Both MLR and RF models have the highest accuracy of 0.99. These results indicate that RF method outperforms MLR and ANN in describing the age at onset of T1D for age between 5-9 years old by achieving lower RMSE and MAE with a higher R 2 .

Risk factors impacting the age at onset of T1D in (5-9) age group based on MLR, ANN and RF models: Environmental factors and family history of diabetes
Similar to the full cohort, we have used MLR, ANN and RF methods to assess the impact of environmental factors on the age at onset of T1D for the (5-9) age group.
The analysis of the variables that influence the age at onset of T1D in this age group based on the best MLR model together with their corresponding P-value and 95% confidence interval is provided in Table 17 (non-significant interaction variables were excluded for brevity).
Olden's algorithm is also used in this section to identify the relative importance of variables in the age group (5-9) when utilizing the best performing ANN model (Fig 17). The figure reveals that birth year and gender have the strongest positive and negative relationship respectively with the response variable. Similarly, variables that have relative importance close to zero, such as maternal age at child's birth does not have any substantial importance for the response variable. Fig 15. Plots of training and testing data of ANN models for the 5-9 years age group. (a1,a2) y: age at onset, (b1,b2) y: the square root of age at onset, (c1,c2) y: the logarithm of age at onset.

Models
Training data Testing data

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest  Fig 16. Plots of training and testing data of RF models for the 5-9 years age group. (a1,a2) y: age at onset, (b1,b2) y: the square root of age at onset, (c1,c2) y: the logarithm of age at onset. https://doi.org/10.1371/journal.pone.0264118.g016

PLOS ONE
Predicting age at onset of T1D in children using regression, artificial neural network and Random Forest  Also, RF was used to identify the important variables in the age group (5-9) (Fig 18). The figure indicates that weight and height at diagnosis are the most important variables for the response variable followed by birth year and birth weight.

Discussion
De-identified data on 359 children with T1D collected from three cities, have been analysed to obtain an insight into the distribution of the age at onset of T1D in Saudi Arabia. The analysis show that there is an overall upward trend in the incidence of T1D in children from 2010 to 2020 with a higher incidence in females. This was also reported for other populations from the Middle East and North Africa (MENA) region [78,79]. Analyses to identify the factors showing the strongest relationship to the age at onset compared models derived with MLR, ANN and RF. Using the best subset model selection criteria, coefficient of determination, and diagnostic tests of residuals, the most significant independent variables were identified as: city, pregnancy length, consanguineous parents, birth weight, birth year, child's weight and height at diagnosis. The efficacy of models for predicting the age at onset was assessed using multi-prediction accuracy measures, coefficient of determination (R 2 ), root mean square error (RMSE) and mean absolute error (MAE). To improve the efficacy of the MLR models, interactions between independent variables were considered. The MLR models were selected based on the step-wise selection criteria of the smallest Akaike's Information Criteria (AIC).

Modeling age at onset of T1D
In all MLR, ANN and RF models, different transformations were considered for the age at onset of T1D to find the best model for prediction. The study found the logarithm of age at onset of T1D was the best choice for the dependent variable when using both the MLR and RF methods. The analyses showed that MLR model M3 and RF model RF3 outperformed the ANN model with a higher R 2 = 0.88 and 0.89 and smaller RMSE = (0.22 and 0.21) and MAE = (0.18 and 0.17) respectively.

The impact of environmental factors and family history of diabetes on the age at onset of T1D
As shown in the analysis in this study, the results support previous findings that the age at onset of T1D can be influenced by environmental factors [17,21].The result of the MLR model presented here agrees with previous studies conducted in Israel and Australia [17,80]. It is shown that the birth year (p-value = 0.001) [17] and preterm birth before 37 weeks [17,80] can influence the age at onset of T1D. In our study, weight and height of child based on RF and MLR results were founded to be associated with age at onset of T1D, in agreement with [24,25].
In this study, pre-term birth was identified as significant in the MLR model only, whereas an Austrian study found that moderately preterm birth was significant for age at onset of T1D [81]. The MLR model for this data did not show a significant contribution based on family history, in contrast with previous studies [20][21][22].
Modeling age at onset of T1D for the age group (5)(6)(7)(8)(9) In this study, we have also focused on creating a model for the (5-9) age group because it was the common age group for onset of T1D in the full data analysis (Fig 6) and as reported previously in the literature [21, 76,77].
Similar to the full data, the best MLR was the model based on the logarithm of age at onset. The interaction between the variables did not perform well on this subgroup. In the RF models, the logarithm of age at onset was also the best choice for the data. Comparison of the methods based on the values of the R 2 , RMSE and MAE achieved shows that RF outperforms the MLR and ANN in describing age at onset of T1D in this cohort.
This study shows that consanguineous parents and gender can influence the age at onset of T1D. Furthermore, it also indicates that birth weight, birth year, weight and height can influence the age at onset of T1D in (5-9) age group.
Utilisation of both traditional statistical multiple linear regression and machine learning approaches should not be regarded as in conflict when the aim is prediction [82]. Although MLR relies on strong assumptions, including the type of error distribution and the additivity of the parameters [82], it has the advantage of being simple to understand the underlying biological relationship. Whereas the results of ANN and RF are often difficult to interpret [82,83]. ANN and RF analysis provides feature importance, but does not provide complete visibility of the coefficients as linear regression does. However, they may help in understanding the intricate relationships between inputs and determining their impact on the main outcome. They have the flexibility and are free from a priori assumptions. RF has the advantage of a built-in feature selection method, handling many input variables without the need to minimize dimensionality and controls the overfitting by using out-of-bag validation. The other potential disadvantage of machine learning methods revolves around computation complexity. The ANN computation is complex and time-consuming, depending on the type of features used, the number of nodes and layers of the neural network and the number of training data [84]. For large datasets, RF can be computationally intensive as it may require a large number of trees [83]. However, computational time was not an issue in this research due to the fact that the sample size of the data set was not large.

Conclusions
This study has utilised MLR, ANN and RF to model the age at onset of T1D. The results indicate that the models developed for this data with MLR and RF outperform models using ANN and the best choice for the dependent variable (age at on set) is the the logarithm of age at onset. Results indicated that the selected MLR and RF models can predict the age at onset with a high values of R 2 (0.88 and 0.89) and reasonably small values of RMSE (0.22 and 0.21) and MAE (0.18 and 0.17) for age group of <15 years old. The results also show that the selected RF model with a value of R 2 (0.78) and RMSE and MAE of (0.08 and 0.07) respectively outperforms the other models for age group of (5-9). The low performance for the ANN models is likely a result of the number of observations which subsequent may improve with a larger sample size. As the incidence rate of T1D in children in Saudi Arabia is increasing, it is important for studies to capture the diversity and the culture of the population in Saudi Arabia which is not currently found in the key studies of T1D in children which are from European populations. Hence this study, aimed to address the age at onset of T1D in children in Saudi Arabia utilizing local data and current statistical techniques. The outcomes of this study can effectively aid authorities to identify the risk factors influencing the age at onset of T1D in children to take the appropriate intervention to reduce the impact of delayed T1D diagnosis and potentially slow the incidence rate of T1D in children. This study highlights the need for a national database to further evaluate and improve the model for predictions. To further address the limitations of this study other risk variables, such as maternal weight at childbirth, which has been suggested as a risk factor by Swedish research [27], should also be collected. This is of importance given obesity among females in Saudi Arabia has increased over the past decade [85]. Furthermore, including a larger number of cities in the research would improve both the diversity and increase the sample size, which would consequently provide a more robust model for prediction. In addition, a unified electronic health record between all hospitals in the country will facilitate obtaining pregnancy variables and birth characteristics when a mother's pregnancy is followed up at hospitals other than the one where gives birth.
Supporting information S1